---
Authors:
  - Sonja H�ffner
  - Martin Hofer
  - Maximilian Nagl
  - Julian Walterskirchen
---

# README for replicating an "Introducing an Interpretable Deep Learning Approach to Domain-Specific Dictionary Creation: A Use Case for Conflict Prediction"

## Paper Abstract

Recent advancements in natural language processing (NLP) methods have significantly improved their performance. However, more complex NLP models are more difficult to interpret and computationally expensive. Therefore, we propose an approach to dictionary creation that carefully balances the trade-off between complexity and interpretability. This approach combines a deep neural network architecture with techniques to improve model explainability to automatically build a domain-specific dictionary. As an illustrative use case of our approach, we create an objective dictionary that can infer conflict intensity from text data. We train the neural networks on a corpus of conflict reports and match them with conflict event data. This corpus consists of over 14,000 expert-written International Crisis Group (ICG) CrisisWatch reports between 2003 and 2021. Sensitivity analysis is used to extract the weighted words from the neural network to build the dictionary. In order to evaluate our approach, we compare our results to state-of-the-art deep learning language models, text-scaling methods, as well as standard, non-specialized, and conflict event dictionary approaches. We are able to show that our approach outperforms other approaches while retaining interpretability.

## Overview

The code in this repository replicates the analysis for "Introducing an Interpretable Deep Learning Approach to Domain-Specific Dictionary Creation: A Use Case for Conflict Prediction" using Python and R. There is one main script ('1_main.py') that runs most of the code to generate the data for the figures and tables in the paper (and the Appendix). The replicator should expect the code to run for about 40 hours. Replicating the fine-tuning of the BERT model is not recommended on a regular computer due to the high computational requirements.

Python version 3.9.15 <https://www.python.org/downloads/release/python-3915/>, Anaconda version 2021.11 (Build Channel py39_0, <https://anaconda.org/anaconda/anaconda/2021.11/download/win-64/anaconda-2021.11-py39_0.tar.bz2>), R version 4.1.2 <https://cran.r-project.org/bin/windows/base/old/4.1.2/>, and Rtools 4.0 <https://cran.r-project.org/bin/windows/Rtools/rtools40.html> are required. Conda 4.11.0 <https://anaconda.org/anaconda/conda/files?version=4.11.0&page=2> is recommended.

## Data Availability

Datasources used:

-   International Crisis Group CrisisWatch reports between 2003 and 2021 (<https://www.crisisgroup.org/crisiswatch>). The Copyright and Trademark notice can be found in `crisisgroup_copyright.txt` and online at <https://www.crisisgroup.org/legal/copyright-and-trademark-notice>.

  Datafile: `Crisiswatch_2003_2021.pkl`
	

-   Uppsala Conflict Data Program (UCDP) Georeferenced Event Dataset (GED) Global version 22.1 (<https://ucdp.uu.se/downloads/ged/ged221-csv.zip>). The Uppsala Conflict Data can be accessed via their website (<https://ucdp.uu.se/downloads/>) where one has to navigate to the appropriate Datasources. For this paper we used the Georeferenced Event Dataset (GED) which can be found under the "Disaggregated Datasets" section. We downloaded version 22.1 as a csv file.

  Datafile: `GEDEvent_v22_1.csv`

### Statement about Rights

-   [ ] I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.
-   [x] I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package.

### Summary of Availability

-   [x] All data **are** publicly available.
-   [ ] Some data **cannot be made** publicly available.
-   [ ] **No data can be made** publicly available.

### Details on Secondary Data Sources

Summary of **secondary data sources**:

| Data.Name                     | Data.Files                | Location | Provided | Citation                                           |
|---------------|---------------|---------------|---------------|---------------|
| CrisisWatch reports 2003-2021 | Crisiswatch_2003_2021.pkl | data/raw/    | TRUE    | ICG (2022)                                         |
| UCDP GED 22.1                 | GEDEvent_v22_1.csv        | data/raw/    | TRUE     | Sundberg and Melander (2013); Davies et al. (2022) |
| PETRARCH2                     | data/raw/dictionaries/..  |              | TRUE     | Norris et al. (2017) |
| CAMEO codes                   | cameo_main_categores.csv  |  data/raw/   | TRUE     | Norris et al. (2017) |

## Dataset list

| Data file                  | Source          | Notes       | Provided |
|--------------------|------------------|------------------|------------------|
|**raw**| | | |
| `data/raw/Crisiswatch_2003_2021.pkl`  | ICG   | As per terms of use  | Yes |
| `data/raw/GEDEvent_v22_1.csv`         | UCDP  | As per terms of use    | Yes |
| `data/raw/dictionaries/..`            | PETRARCH2  | PETRARCH event and actor dictionaries  | Yes |
| `data/raw/cameo_main_categores.csv`   | CAMEO | Event aggregation scheme | Yes  |
| `data/raw/stopwords_`                 | manual coding by the authors       | Used to remove references to locations and entities from our dictionary      | Yes |
|**analysis**| | | |
| `data/analysis/bert_full.csv`           | `final_bert.py` | BERT results on the full dataset and actual observed fatalities  | Yes |
| `data/analysis/cw_bert_pred_zero.csv`       | `final_bert.py` | BERT results for the full set | Yes |
| `data/analysis/cw_pred_test_zero.csv`       | `final_bert.py` | BERT results for the test set | Yes |                                                        
| `data/analysis/fi_scores_103.csv`               | `5_neuralnetwork.py`       | Feature importance scores calculated by the NN                                          | Yes      |
| `data/analysis/lasso_dictionary.csv`            | `3_lasso.py`               | Dictionary words obtained from the Lasso model                                          | Yes      |
| `data/analysis/nn_parameters.csv`               | `5_neuralnetwork.py`       | Parameters obtained from the NNs                                                        | Yes      |
| `data/analysis/nn_results.csv`                  | `5_neuralnetwork.py`       | NN prediction results                                                                   | Yes      |
| `data/analysis/results_hyperparameters.csv`     | `5_neuralnetwork.py`       | Hyperparameter selection for NNs                                                        | Yes      |
| `data/analysis/wordfish_scores.csv`             | `4_wordfish_wordscore.R`   | Wordfish scores for the full dataset                                                    | Yes      |
| `data/analysis/wordscore_scores.csv`            | `4_wordfish_wordscore.R`   | Wordscore scores for the full dataset                                                   | Yes      |
| `data/analysis/cw_texts_clean.csv`              | `2a_preprocessing_cw.py`   | Preprocessed CrisisWatch reports for all models with the exception of PETRARCH and BERT | Yes      |
| `data/analysis/cw_texts_clean_bert.csv`         | `2b_preprocessing_bert.py` | Preprocessed CrisisWatch reports for the BERT model                                     | Yes      |
| `data/analysis/cw_texts_clean_petrarch.csv`     | `2b_preprocessing_bert.py` | Preprocessed CrisisWatch reports for PETRARCH2                                          | Yes      |
| `data/analysis/cw_texts_final.csv`              | `5_neuralnetwork.py`       | CrisisWatch reports and aggregated FI Scores                                            | Yes      |
| `data/analysis/results_cw_petrarch_coded.csv`   | `6_petrarch.py`            | PETRARCH coding of CrisisWatch texts, sentence level                                    | Yes      |
| `data/analysis/results_petrarch.csv`            | `6_petrarch.py`            | PETRARCH coding of CrisisWatch texts, aggregated                                        | Yes      |
|**output**| | | |
| `data/output/evaluation_models.csv`             | `7_evaluation_models.py`         | Random Forest and XGBoost Model performance metrics                                     | Yes      |
| `data/output/full_dataset.csv`                  | `7_evaluation_models.py`         | Random Forest and XGBoost results for the full dataset                                  | Yes      |
| `data/output/predictions_testset.csv`           | `7_evaluation_models.py`         | Random Forest and XGBoost predictions for all models on the test set                    | Yes      |
| `data/output/bert_metrics.csv`                  | `final_bert.py`            | Evaluation metrics for BERT                                                             | Yes      |
| `data/output/fi_scores_103.csv`               | `5_neuralnetwork.py`       | Feature importance scores calculated by the NN                                          | Yes      |

Since the BERT model has to be run separately, there is also a dedicated folder for all BERT related input and output. Hence, some of the files are duplicated in this separate file structure:

| Data file                  | Source          | Notes       | Provided |
|--------------------|------------------|------------------|------------------|
|**bert**| | | |
| `bert/cw_texts_clean_bert.csv`  | `2b_preprocessing_bert.py`   | Preprocessed CrisisWatch reports for the BERT model  | Yes  |
| `bert/bert_metrics.csv`         | `final_bert.py` | Evaluation metrics for BERT | Yes |
| `bert/cw_bert_pred_zero.csv`       | `final_bert.py` | BERT results for the full set | Yes |
| `bert/cw_pred_test_zero.csv`       | `final_bert.py` | BERT results for the test set | Yes |
|**bert/output/**| | | |
| `bert/output/..`       | `final_bert.py` | BERT configuration and system files | Yes |


## Computational requirements

To run the replication one needs to setup the correct [Python](./replication_env.yaml) environment. After installing Python 3.9.15 one can create the environment using the `replication_env.yaml` file. With conda this can be installed via:

    conda env create -f \PATH\replication_env.yaml

To run the replication one also needs to setup R version 4.1.2 (2021-11-01). The required R packages will be loaded via the `main.py` file, which loads the [R](./code/packages.R) packages.

As environments created with yaml can cause some cross platform issues we also provide a guide to set up the environment for Linux.
For **Linux** operating systems one has to set-up the correct Python, Anaconda, R, and Rtools versions (see **Overview**). Following this, one has to:

1. Set up a new environment using `conda create -n "myname" python=3.9.15`
2. Install numpy and tensorflow via requirements.txt using `conda install --file requirements.txt` (numpy and tensorflow can otherwise create version conflicts)
3. Install the rest of the required packages via requirements2.txt. using on Linux `python3 -m pip install install -r requirements2.txt` or on Windows `pip install -r requirements2.txt`

### Software Requirements

-   **Python 3.9.15**
Below the most important python packages are listed:
```
numpy==1.22.3
pandas==1.4.3
pycountry_convert==0.7.2
nltk==3.7
re==2.2.1
openpyxl==3.0.9
scikit-learn==1.0.2
joblib==1.1.0
country_converter==0.7.4
gensim==4.2.0
spacy==3.3.1
pyjanitor==0.24.0
regex==2022.3.15
tensorflow==2.6.0
keras==2.6.0
autograd==1.3
xgboost==1.5.1
pysentiment2==0.1.1
tqdm=4.64.0
protobuf==3.20.1
stanza==1.4.2
```

-   the file "`replication_env.yaml`" gives the full list of dependencies and version numbers, please run "`conda env create -f \PATH\replication_env.yaml`" in your Anaconda prompt as the first step (change `\PATH\` according to your system).

-   **R 4.1.2**
The file "`packages.R`" will install all R packages, and will be run via the `main.py` script.

```
    library(data.table) # version 1.14.4
    library(ggplot2) # version 3.4.0
    library(corrplot) # version 0.92
    library(dplyr) # version 1.0.10
    library(Hmisc) # version 4.7-1
    library(ggthemes) # version 4.2.4
    library(quanteda) # version 3.2.1
    library(quanteda.textmodels) # version 0.9.4
```



### Memory and Runtime Requirements

Most code takes a couple of minutes to run (details about run time for each script element can be found in `main.py`. Note that `5_neuralnetwork.py` takes over 1 day to run, `6_petrarch.py` takes over 10 hours, and `final_bert.py` should be run on a GPU cluster.

Approximate time needed to reproduce the analyses on a standard (2022) laptop (not including BERT models):

-   [x] 40 hours

The code was last run on a **Intel(R) Core(TM) i7-10750H CPU \@ 2.60GHz laptop with Windows 10 Pro x64, v.19044**.

## Description of code

![Step-by-step Code and Data Guide](code_guide.png "Code Guide")

-   The file `code/main.py` will run the whole analysis, excluding the fine-tuning of BERT (see below). Running this takes a considerable amount of time. 
-   All code used for our manuscript is located in the `code/` folder:
    -   `2a_preprocessing_cw.py` prepares the CrisisWatch texts for the analysis.
    -   `2b_preprocessing_bert.py` prepares the CrisisWatch texts for the BERT model and PETRARCH2.
    -   `packages.R` installs the necessary R packages.
    -   `3_lasso.py` runs the Lasso model and creates a dictionary based on it.
    -   `4_wordfish_wordscore.R` runs the wordfish and wordscore models on the CrisisWatch texts.
    -   `5_neuralnetwork.py` runs our Neural Network models and produces our objective conflict dictionary. (This part alone takes around 25-28 hours)
    -   `6_petrarch.py` runs PETRARCH2 on CrisisWatch texts. This script imports `petrarch2_all.py` as a module. (This part takes over 10 hours)
    -   `final_bert.py` runs the fine-tuning of our BERT model. This part cannot be run from `main.py`. For further instructions on the BERT model see below.
    -   `7_evaluation.py` runs the Random Forest and XGBoost models that were used to evaluate the features created by all other models.
    -   `8_visualizations.R` creates all data based Figures (Figures 2a, 2b, 6, 7, 8, 9 of the main article and Figure 1 of the Appendix).



## Instructions to Replicators

-   run `conda env create -f \PATH\replication_env.yaml` in your Anaconda prompt. Adjust the path according to where the .yaml file is located. Select the environment in the python interpreter.
-   In `code/main.py`:
    -   adjust `command = 'PATH/R/R-4.1.2/bin/Rscript.exe'` in line 36 to locate the correct R version to launch the `RScript.exe` from (in Windows normally in C:/Program Files/)
-   In `6_petrarch.py` select between `read_dictionaries_windows()` and `read_dictionaries_linux()` in lines 84-86 depending on if one is using Windows or Linux.
-   Run `code/main.py` until line 48 `file_6 = __import__('6_petrarch')`.
    - If one does not want to independently replicate the BERT results, one can run the whole `main.py` (the files produced by `final_bert.py` are provided in `data/analysis/`)
-   Run `final_bert.py` in a docker container (details below). This can be run anytime after `2b_preprocessing_bert.py` has been executed.
-   Run the remaining parts of `code/main.py`.

## Instructions to Replicators using **Linux**

-   Set up a new environment using `conda create -n "myname" python=3.9.15`
-   Install numpy and tensorflow via requirements.txt using `conda install --file requirements.txt` (numpy and tensorflow can otherwise create version conflicts)
-   Install the rest of the required packages via requirements2.txt. using on Linux `python3 -m pip install install -r requirements2.txt` or on Windows `pip install -r requirements2.txt`
-   In `code/main.py`:
    -   adjust `command = 'PATH/R/R-4.1.2/bin/Rscript.exe'` in line 36 to locate the correct R version to launch the `RScript.exe` from (in Windows normally in C:/Program Files/)
-   In `6_petrarch.py` select between `read_dictionaries_windows()` and `read_dictionaries_linux()` in lines 84-86 depending on if one is using Windows or Linux.
-   Run `code/main.py` until line 48 `file_6 = __import__('6_petrarch')`.
    - If one does not want to independently replicate the BERT results, one can run the whole `main.py` (the files produced by `final_bert.py` are provided in `data/analysis/`)
-   Run `final_bert.py` in a docker container (details below). This can be run anytime after `2b_preprocessing_bert.py` has been executed.
-   Run the remaining parts of `code/main.py`.


### General details

- These programs were last run in December 2022.
- Order does matter, follow the **Instructions to Replicators**.
- `code/main.py` will run **most** code in sequence, which should take about 40 hours.
- `code/final_bert.py` needs to be run separately. `final_bert.py` requires the data created by `2b_preprocessing_bert.py`. The fine-tuning of BERT models requires considerable computational power.

    
### Details BERT

- Download Docker image: <nvcr.io/nvidia/pytorch:21.12-py3> or via `docker pull nvcr.io/nvidia/pytorch:21.12-py3`
- Run docker container (see instructions on <https://docs.docker.com/engine/reference/commandline/run/>). Important to set dataframe path in `final_bert.py` depending on local or docker container run. 
- Install required additional packages:
  - `pip install transformers` (installs huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1)
  - `pip install datasets` (installs aiohttp-3.8.3 aiosignal-1.3.1 async-timeout-4.0.2 datasets-2.7.1 dill-0.3.6 frozenlist-1.3.3 multidict-6.0.3 multiprocess-0.70.14 pyarrow-10.0.1 responses-0.18.0 xxhash-3.1.0 yarl-1.8.2)
- Run code `python /ws/final_bert.py`
- NVIDIA_SMI 470.103.01; Driver version 470.103.01; CUDA version 11.5 
- Python 3.8.12 | packaged by conda-forge

### Details PETRARCH

We had to process several thousand country reports from the International Crisis Group. 
All reports are in English and consisted of several sentences. As we expected that one report does describe several events,
we processed all the sentences of a report separately and tried to extract an event out of it.
No efforts were made to update the dictionaries or to look for events in the context of several sentences.

Instead of using the pip package `corenlp` which is a python wrapper for the Stanford CoreNLP java library last updated on Dec 1, 2015
we used the newer (and well-maintained) pip package `stanza` the Stanford NLP Python Library by the same provider.

#### Adaptation of the PETRARCH2 for "Introducing an Interpretable Deep Learning Approach to Domain-Specific Dictionary Creation: A Use Case for Conflict Prediction"

All the ideas and code use here is due to Philip Schrodt and his collaborators (Norris et al., 2017).
The code was fetched from the GitHub-repository [petrarch2](https://github.com/openeventdata/petrarch2/), and we used to branch `master`
with the last commit from Mar 18, 2019 (commit: 676801c).

Adaptations we made included:
- Changes in order to make the code compatible to running in Python 3.9
- Deleted part of the code which was not necessary for our application
- Deleted several comments which were outdated or did become incorrect due to our changes and tests
- We put all the essential code we used from the original repository in one script

All the changes made were only technical and the outcome should be more or less invariant to these changes in our use case.

The software package PETRARCH2 is issued under the MIT License and as we used 'substantial portions of the Software' the following
copyright and permission notice is valid:

---
Copyright (c) 2014 Open Event Data

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
---



## List of tables and programs

The provided code reproduces:

-   [ ] All numbers provided in text in the paper
-   [ ] All tables and figures in the paper
-   [x] Selected tables and figures in the paper, as explained and justified below.

| Figure/Table      | Code             | Line Number | Output file                    | Note |
|---------------|---------------|---------------|---------------|---------------|
| Figure 1          | n.a. (no data)   |             | Figure1_Process.pdf            |      |
| Figure 2a         | Visualizations.R | 209         | Figure2a_ReportsHist.pdf       |      |
| Figure 2b         | Visualizations.R | 235         | Figure2b_FatalitiesHist.pdf    |      |
| Figure 3          | n.a. (no data)   |             | Figure3_NNarchitecture.pdf     |      |
| Figure 4          | n.a. (no data)   |             | Figure4_text.pdf               | Source: ICG CrisisWatch reports     |
| Figure 5          | n.a. (no data)   |             | Figure5_text2.pdf              |      |
| Table 1           | 5_neuralnetwork.py| 855        | fi_scores_103.csv              | Table manually put together from csv output      |
| Figure 6          | Visualizations.R | 128         | Figure6_FI_Distribution.pdf    |      |
| Figure 7          | Visualizations.R | 153         | Figure7_Timeline.pdf           |      |
| Figure 8          | Visualizations.R | 93          | Figure8_Correlationplot.pdf    |      |
| Table 2           | 7_evaluation_models.py| 1590   | evaluation_models.csv; bert_metrics  | Table manually put together from csv output.    |
| Figure 9          | Visualizations.R | 264         | Figure9_PredictionVSActual.pdf |      |
| Appendix Table 1  | 7_evaluation_models.py | 401            | evaluation_models.csv                                |      |
| Appendix Table 2  | 7_evaluation_models.py                 | 401            | evaluation_models.csv                           |      |
| Appendix Table 3  | 7_evaluation_models.py                 | 401            | evaluation_models.csv                                |      |
| Appendix Figure 1 | Visualizations.R | 309            | Appendix_Figure1.pdf           |      |

## References

Davies, Shawn, Therese Pettersson & Magnus �berg (2022). Organized violence 1989-2021 and drone warfare. Journal of Peace Research 59(4).

Sundberg, Ralph and Erik Melander (2013) Introducing the UCDP Georeferenced Event Dataset. Journal of Peace Research 50(4).

ICG (International Crisis Group) (2022) CrisisWatch. <https://www.crisisgroup.org/crisiswatch>.

Norris, C., P. Schrodt, and J. Beieler. 2017. "PETRARCH2: Another Event Coding Program." The Journal of Open Source Software 2, no. 9 (January) <https://doi.org/10.21105/joss.00133>

------------------------------------------------------------------------

## Acknowledgements

The README follows the schema provided by the Social Science Data Editors' template at <https://social-science-data-editors.github.io/template_README/>.
